A feature selection Bayesian approach for a clustering genetic algorithm
نویسنده
چکیده
Feature selection is an important task in clustering problems. Some features help to find useful clusters whereas others may hinder the clustering process. In other words, some selected features can provide better clusters. Besides, the feature selection process also allows the reduction of the dataset dimensionality, improving the clustering method efficiency. This work describes a Bayesian feature selection approach for a Clustering Genetic Algorithm (CGA). The general method can be described by means of four steps: (i) apply the CGA to some selected objects (sample) of the complete dataset; (ii) consider that the obtained clusters form different classes, which can be modeled by Bayesian networks; (iii) generate a Bayesian network and employ the Markov Blanket of the class variable to the feature subset selection task; (iv) apply the CGA in the complete dataset now formed only by the selected features. Initially, we are mainly interested in evaluating if the feature selection process makes sense in the context of the CGA, which can find the best clustering in a dataset according to the Average Silhouette Width criterion. Thus, our first investigation supposes an ideal situation, where the CGA has actually found the right clustering in step (i). Thus, the Bayesian networks are generated not in a sample, but in the complete dataset correctly clusteredlclassified. In this way we can better evaluate if the proposed hybrid method is appropriate, i.e. if the features selected by means of Bayesian networks are suitable for the CGA. In this sense, we performed simulations in three datasets that are benchmarks for data mining methods Wisconsin Breast Cancer, Mushroom and Congressional Voting Records. The results obtained in the simulations performed in the datasets formed by the selected features provided better results than those obtained in the complete datasets. Thus, we believe that the proposed method is very promising. Transactions on Information and Communications Technologies vol 29, © 2003 WIT Press, www.witpress.com, ISSN 1743-3517
منابع مشابه
Improvement of effort estimation accuracy in software projects using a feature selection approach
In recent years, utilization of feature selection techniques has become an essential requirement for processing and model construction in different scientific areas. In the field of software project effort estimation, the need to apply dimensionality reduction and feature selection methods has become an inevitable demand. The high volumes of data, costs, and time necessary for gathering data , ...
متن کاملFeature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach
Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...
متن کاملFeature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets
Objective(s): This study addresses feature selection for breast cancer diagnosis. The present process uses a wrapper approach using GA-based on feature selection and PS-classifier. The results of experiment show that the proposed model is comparable to the other models on Wisconsin breast cancer datasets. Materials and Methods: To evaluate effectiveness of proposed feature selection method, we ...
متن کاملData Clustring Using A New CGA(Chaotic-Generic Algorithm) Approach
Clustering is the process of dividing a set of input data into a number of subgroups. The members of each subgroup are similar to each other but different from members of other subgroups. The genetic algorithm has enjoyed many applications in clustering data. One of these applications is the clustering of images. The problem with the earlier methods used in clustering images was in selecting in...
متن کاملA Parallel Genetic Algorithm Based Method for Feature Subset Selection in Intrusion Detection Systems
Intrusion detection systems are designed to provide security in computer networks, so that if the attacker crosses other security devices, they can detect and prevent the attack process. One of the most essential challenges in designing these systems is the so called curse of dimensionality. Therefore, in order to obtain satisfactory performance in these systems we have to take advantage of app...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003